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Abstract 

Emphatic algorithms are temporal-difference learning algorithms that change their ef¬ 
fective state distribution by selectively emphasizing and de-emphasizing their updates on 
different time steps. Recent works by Sutton, Mahmood and White (2015), and Yu (2015) 
show that by varying the emphasis in a particular way, these algorithms become stable and 
convergent under off-policy training with linear function approximation. This paper serves 
as a unified summary of the available results from both works. In addition, we demonstrate 
the empirical benefits from the flexibility of emphatic algorithms, including state-dependent 
discounting, state-dependent bootstrapping, and the user-specified allocation of function 
approximation resources. 
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1. Introduction 

A fundamental problem in reinforcement learning involves learning a sequence of long-term 
predictions in a dynamical system. This problem is often formulated as learning approxima¬ 
tions to value functions of Markov decision processes (Bertsekas &: Tsitsiklis 1996, Sutton 
& Barto 1998). Temporal-difference learning algorithms, such as TD(A) (Sutton 1988), 
GQ(A) (Maei & Sutton 2010), and LSTD(A) (Boyan 1999, Bradtke Sz Barto 1996), provide 
effective solutions to this problem. These algorithms stand out particularly because of their 
ability to learn efficiently on a moment-by-moment basis using memory and computational 
complexity that is constant in time. These methods are also distinguished due to their 
ability to learn from other predictions, a technique known as bootstrapping, which often 
provides fast and more accurate answers (Sutton 1988). 

TD algorithms conventionally make updates at every state visited, implicitly giving 
higher importance, in terms of function-approximation resources, to states that are visited 
more frequently. As the value cannot be estimated accurately under function approximation, 
valuing some states more means valuing others less. We may, however, be interested in 
valuing some states more than others based on criteria other than visitation frequency. 
Conventional TD updates do not provide that flexibility and cannot be naively modified. 
For example, in the case of off-policy TD updates, updating according to one policy while 
learning about another can cause divergence (Baird 1995). 

In this paper, we discuss emphatic TD(A) (Sutton et al. 2015), a principled solution 
for the problem of selective updating, where convergence is ensured under an arbitrary 
interest in visited states as well as off-policy training. The idea is to emphasize and de- 
emphasize state updates with user-specific interest in conjunction with how much other 
states bootstrap from that state. We first describe this idea in a simpler case: linear 
function approximation with full bootstrapping (i.e., A = 0). We then derive the full 
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algorithm for the more general off-policy learning setting with arbitrary bootstrapping. 
Finally, after briefly summarizing the available results on the stability and convergence of 
the new algorithm, we discuss the use and the potential advantages of this algorithm using 
an illustrative experiment. 

2. The problem of selective updates 

Let us start with the problem of selective updating in the simplest function approximation 
case: linear TD(A) with A = 0. Consider a Markov decision process (MDP) with a finite 
set § of states and a finite set A of actions, for the discounted total reward criterion 
with discount rate 7 G [0,1). In this setting, an agent interacts with the environment by 
taking an action T* G Al at state St G § according to a policy vr : Al x § —)■ [ 0 , 1 ] where 
7 r(a|s) = P{ylt = a|5t = s}^, transitions to state St+i G S, and receives reward Rt+i G M in 
a sequence of time steps t > 0. Let Pjr G denote the state transition probability 

matrix and r^r G the expected immediate rewards from each state under tt. The value 
of a state is then defined as: 


= E^[Gt|5t = s], (1) 

where E7r[-] denotes an expectation conditional on all actions being selected according to tt, 
and Gt, the return at time t, is a random variable of the future outcome: 

Gt = Rt+i + '^Rt+2 + 'y‘^Rt+3 + • • • • ( 2 ) 

We approximate the value of a state as a linear function of its features: 0^cj){s) pe u^(s), 
where 0(s) G is the feature vector corresponding to state s. Conventional linear TD(0) 
learns the value function by generating a sequence of parameter vectors Ot G 

et+i = Gt +a [Rt+i + lOj^{St+i) - ej 0(5i)) </.(5t), (3) 

where a > 0 is a step-size parameter. 

Additionally, we may have a relative interest in each state, denoted by a nonnegative 
interest funetion i : S —)■ [0, 00 ). For example, in episodic problems we often care primarily 
about the value of the first state, or of earlier states generally (Thomas 2014). A straight¬ 
forward way to incorporate the relative interests into TD(0) would be to use i{St) as a 
factor to the update on each state St: 

9t+i =9t + a + ^9j 0(5t+i) - 9jct>{St)) (4) 

In order to illustrate the problem of this approach, suppose there is a Markov chain 
consisting of two non-terminal and a terminal state with features </>(!) = 1 and </>(2) = 2 
and interests i(l) = 1 and i(2) = 0 (cf. Tsitsiklis &: Van Roy 1996): 

Then the estimated values are 6 and 26 for a scalar parameter 0 G M. Suppose that 6 is 10, 
the reward on the first transition is 0. The transition is then from a state valued at 10 to a 

1. The notation = indicates an equality by definition. 
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state valued at 20. If 7 = 1 and a is 0.1, then 6 will be increased to 11. But then the next 
time the transition occurs there will be an even bigger increase in value, from 11 to 22, and 
a bigger increase in 0, to 12.1. If this transition is experienced repeatedly on its own, then 
the system is unstable and the parameter increases without bound—it diverges. 

This problem arises due to both bootstrapping and the use of function approximation, 
which entails shared resources among the states. If a tabular representation was used 
instead, the value of each state would be stored independently and divergence would not 
occur. Likewise, if the value estimate of the first state was updated without bootstrapping 
from that of the second state, such divergence could again be avoided. 

Emphatic TD(0) (Sutton et al. 2015) remedies this problem of TD(0) by emphasizing 
the update of a state, depending on how much a state is bootstrapped in conjunction with 
the relative interest in that state. Although A = 0 gives full bootstrapping, the amount of 
bootstrapping is still modulated by 7. For example, if 7 = 0, then no bootstrapping occurs 
even with A = 0. The amount of emphasis to the update of a state at time t is: 

Ft = i{St) + 7iiSt-i) + + • • • + l^i{So) = i{St) + iFt-i. (5) 

The following update defines emphatic TD(0): 

Gt+i = et + a + 707 0(5t+i) - ej 0 ( 57 ) FMSt). 

According to this algorithm, the value estimate of a state is updated if the user is interested 
in that state or it is reachable from another state in which the user is interested. Going 
back to the above two-state example, the second state value is now also updated despite 
having a user-specified interest of 0. In fact. Ft is equal for both states, and updating 
is exactly equivalent to on-policy sampling; hence, divergence does not occur. For other 
choices of relative interest and discount rate, the effective state distribution can be different 
than on-policy sampling, but the algorithm still converges as we show later. 

3. ETD(A): The off-policy emphatic TD(A) 

In this section, we develop the emphatic TD algorithm, which we call ETD(X), in the generic 
setting of off-policy training with state-dependent discounting and bootstrapping. 

Let 7 : S —)• [0,1] be the state-dependent degree of discounting; equivalently, 1 — 7(5) 
is the probability of terminating upon arrival in state s. Let A : S —)■ [0,1] denote a 
state-dependent degree of bootstrapping; in particular, 1 — A(s) determines the degree 
of bootstrapping upon arriving in state s. As notational shorthand, we use 7 * = 7(<S'7, 
Xt = X{St), and 4>t = 4>{St)- For TD learning, we define a general notion of bootstrapped 
return, the X-return, with state-dependent bootstrapping and discounting, by 

Gt = Rt+i + Xt+i (^(1 — ^t+i)Gj(pt+i + . 

This return can be directly used to estimate on-policy as long as the agent follows 
TT. However, in off-policy learning, experience is generated by following a different policy 
/i : yi X S —)■ [0,1], often called the behavior policy. To obtain an unbiased estimate of 
the return under vr, the experience generated under // has to be reweighted by importance 
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sampling ratios: pt = > assuming p{a\s) > 0 for every state and action for which 

7r(a|s) > 0. The importance-sampled A-return for off-policy learning is thus defined as 
follows (Maei 2011, van Hasselt et al. 2014); 

= Pt (^-Rt+i + It+i (^(1 - ^t+iPt+i)0j4>t+i + At+iG'^+i^^ • 

The forward-view update of the conventional off-policy TD(A) can be written as: 

Ot+i = et +a - Pt4>j (6) 

The backward-view update with an offline equivalence (cf. van Seijen & Sutton 2014) with 
the above forward view can be written as: 

^i+i = Of + a (^Rt+i + — dj cj)t^ et (7) 

et = Pt ( 7 tAfet_i -F 4>t ), with e_i = 0, (8) 

where e* G M” is the eligibility-trace vector at time t. This algorithm makes an update to 
each state visited under p and does not allow user-specified relative interests to different 
states. Convergence is also not guaranteed in general for this update rule. 

By contrast, instead of (6), we define the forward view of ETD(A) to be; 

Ot+i =et + a - Pt4>jOt) Mtcl>t. (9) 

Here Mt G M denotes the emphasis given to update at time t, and it is derived based on 
the following reasoning, similar to the derivation of Ft for emphatic TD(0). 

The emphasis to the update at state St is first and foremost, due to i(5't), the inherent 
interest of the user to that state. A portion of the emphasis is also due to the amount of 
bootstrapping the preceding state St-i does from St, determined by 7 t(l — Xt)pt-i' the 
probability of not terminating at St times the probability of bootstrapping at St times the 
degree by which the preceding transition is followed under the target policy. Finally, Mt 
also depends on Mt-i, the emphasis of the preceding state itself. The emphasis for state St 
similarly depends on all the preceding states that bootstrap from this state to some extent. 
Thus the total emphasis can be written as: 

MkPk I 

k=0 \i 

t-1 t-1 

where Ft = i{St) + It"^ PkMk lAiPi = i{St) + -ftPt-iFt-i, with F_i = 0, (11) 

k=0 i=k-\-l 

giving the final update for ETD(A), derived from the forward-view update (9): 

^i+i = Of + a (^Rt+i + It+i^J — ^7 4>t^ et (12) 

et = Pt {itXtet-i + Mt(j)t ), with e_i = 0. (13) 

The trace Ft here is similar to that of emphatic TD(0), adapted to the off-policy case through 
the application of pt- According to (10), the emphasis Mt can be written simply as a linear 
interpolation between i{St) and Ft. The per-step computational and memory complexity 
of ETD(A) is the same as that of original TD(A): 0(n) in the number of features. The 
additional cost ETD(A) incurs due to the computation of the scalar emphasis is negligible. 


t-1 

Mt = i{St) + Y, 


t-1 \ 

n 7t(l 


=fc-l-l 


y 


- Xt) = Xti{St) + (1 - Xt)Ft, (10) 
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4. Stability and convergence of ETD(A) 

We have discussed the motivations and ideas that led to the design of the emphasis weight¬ 
ing scheme (10)-(13) for ETD(A). We now discuss several salient analytical properties 
underlying the algorithm due to this weighting scheme, and present the key stability and 
convergence results we have obtained for the algorithm. First, we formally state the condi¬ 
tions needed for the analysis. 

Assumption 1 (Conditions on the target and behavior policies) 

(i) The target policy tt is such that (I — exists, where F is the N x N diagonal 

matrix with the state-dependent discount factors 7 (s),s G S, as its diagonal entries, 
(a) The behavior policy p. induces an irreducible Markov chain on S, with the unique 
invariant distribution df^{.s),s G S, and for all {s,a) G § x A, fr{a\s) > 0 if7r{a\s) > 0. 

Under Assumption l(i), the value function Vt^ is specified by the expected total (dis¬ 
counted) rewards as = (I — P.n-r)”^r^; i.e., is the unique solution of the Bellman 
equation v = -\- PttFu. Associated with ETD(A) is a multistep, generalized Bellman 

equation which is determined by the bootstrapping parameters A(s) and also has as its 
unique solution (Sutton 1995): 

r; = r^ + P>, (14) 

where P^ is a substochastic matrix and r^ G Let $ be the N x n matrix with the 

feature vectors 4>{s)~^,s G S, as its rows. The goal of ETD(A) is to find an approximate 
solution of the Bellman equation (14) in the space {$0 | 6 G M”}. 

Let us call those states on which ETD(A) places positive emphasis weights emphasized 
states. More precisely, under Assumption 1 (ii), we can assign an expected emphasis weight 
m{s) for each state s, according to the weighting scheme (10)-(13), as (Sutton et al. 2015): 

[m(l), m(2), ...,m(iV)] = d^/(I - P^)-i, (15) 

where G denotes the vector with components d^^i{s) = d^{s) ■ i{s),s G S. Empha¬ 
sized states are precisely those with m{s) > 0. It is important to observe from (15) that the 
emphasis weights m{s) reflect the occupancy probabilities of the target policy, with respect 
to P^ and an initial distribution proportional to d^^j, rather than the behavior policy. As 
will be seen shortly, this gives ETD(A) a desired stability property that lacks normally in 
TD(A) algorithms with selective updating. 

Let M denote the diagonal matrix with the emphasis weights m{s) on its diagonal. By 
considering the stationary case, the equation that ETD(A) aims to solve is shown by Sutton 
et al. (2015) to be: 

A0 = b, 0 G (16) 

where A = (I - P^) b = $’^Mr^. (17) 

In terms of the approximate value function v = $0, under a mild condition on the approx¬ 
imation architecture given below, the equation (16) is equivalent to a projected version of 
the Bellman equation (14): 

^; = n(r^ + P», G {$0 I 0 GM”}, (18) 

2. Specifically, with A denoting the diagonal matrix with A(s),s £ S, as its diagonal entries, we have 
= I - (I - P^rPA)-! (I - P,,r) and = (I - P.rA)~i r,,. 
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where II denotes projection onto the approximation subspace with respect to a weighted Eu¬ 
clidean norm or seminorm 11-11^) defined by the emphasis weights as ||u||^ = Yls&s rn{s)v{s)‘^. 

Assumption 2 (Condition on the approximation architecture) 

The set of feature vectors of emphasized states, {(t>{s) | s G §, m(s) > 0}, contains n linearly 
independent vectors. 

We note that Assumption 2 (which implies the linear independence of the columns 
of <I>) is satisfied in particular if the set of feature vectors, {0(s) | s G S, i{s) > 0}, 
contains n linearly independent vectors, since states with positive interest i{s) are among 
the emphasized states. So this assumption can be easily satisfied in reinforcement learning 
without model knowledge. 

We are now ready to discuss an important stability property underlying our algorithm. 
By making the emphasis weights m{s) reflecting the occupancy probabilities of the target 
policy, as discussed earlier, the weighting scheme (10)-(13) of our algorithm ensures that 
the matrix A is positive definite under almost minimal conditions for off-policy training;^ 

Theorem 1 (Stability property of A) Under Assumptions 1-2, the matrix A is positive 
definite (that is, there exists c > 0 such that 6^ AO > c II 0 II 2 for all 6 G MA). 

This property of A shows that the equation (16) associated with ETD(A) has a unique 
solution 0* (equivalently, the equation (18) has the approximate value function v = ^0* as 
its unique solution). Moreover, it shows that unlike normal TD(A) with selective updating, 
here the deterministic update in the parameter space, 9t+i = Ot — a{A6t — b), converges 
to 0* for sufficiently small stepsize a, and when diminishing stepsizes {at} are used in 
ETD(A), {9*} is globally asymptotically stable for the associated “mean ODE” 9 = — A0-|-b 
(Kushner &: Yin 2003).^ We are now ready to address the convergence of the algorithm. 

Assumption 3 (Conditions on noisy rewards and diminishing stepsizes) 

(i) The variances of the random rewards {Rt} are bounded. 

(a) The (deterministic) stepsizes {at} satisfy that at = 0{l/t) and = 0{l/t). 

Under the preceding assumptions, we have the following result, proved in (Yu 2015):® 

3. The conclusion of Theorem 1 for the case of an interest function if) > 0 is first proved by Sutton, 
Mahmood, and White (see their Theorem 1); Theorem 1 as given here is proved by Yu (2015) (see Prop. 
C.2 and Remark C.2 in Appendix C therein). The analyses in both works are motivated by a proof idea 
of Sutton (1988), which is to analyze the structure of the N x N matrix M(I —P^) and to invoke a result 
from matrix theory on strictly or irreducibly diagonally dominant matrices (Varga 2000, Cor. 1.22). 

4. The important analytical properties discussed here can be shown to also extend to the case where the 
linear independence condition in Assumption 2 is relaxed: there, A acts like a positive dehnite matrix 
on the subspace of 0 (the range space of A) that ETD(A) naturally operates on. These extensions are 
based on both our understanding of how the weighting scheme (10)-(13) is designed (Sutton et al. 2015) 
and the special structure of the matrix M(I — P^) revealed in the proof of (Yu 2015, Prop. C.2). We 
will report the details of these extensions in a separate paper, however. 

5. The proof is similar to but more complex than the convergence proof for off-policy LSTD/TD (Yu 

2012). Among others, we show that despite the high variance in off-policy learning, the Markov chain 
{{St, At, et. Ft)} on the joint space S exhibits nice properties including ergodicity. We use these 

properties together with convergence results for a least-squares version of ETD(A) and a convergence 
theorem from stochastic approximation theory (Kushner & Yin 2003, Theorem 6.1.1) to establish the 
desired convergence of ETD(A) and its constrained variant by a “mean ODE” based proof method. 
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Theorem 2 (Convergence of ETD(A)) 

Let Assumptions 1-3 hold. Then, for each initial Oq G M", the sequence {Of} generated by 
ETD(X) converges to 0* with probability 1. 

To satisfy the stepsize Assumption 3(ii), we can take at = ci/{c 2 + t) for some constants 
ci,C 2 > 0, for example. If the behavior policy is close to the target policy, we believe that 
ETD(A) also converges for larger stepsizes. 


In this section we describe an experiment to illustrate 
the flexibility and benefits of ETD(A) in learning sev¬ 
eral off-policy predictions in terms of value estimates. 

In this experiment we used a gridworld problem 
depicted in Figure 1, which we call the Miner prob¬ 
lem. Here a miner starting from the cell S continu¬ 
ally wandered around the gridworld using one of the 
following actions: left, right, up and down, each in¬ 
dicating the direction of the miner’s movement. An 
invalid direction such as going down from S resulted 
in no movement. The miner got zero reward at every 
transition except when it arrived at the cell denoted 
by Gold, in which case a -|-1 reward was obtained. 
There were two routes to reach the Gold cell from S: 
one went straight up through Block D, and the other 
was roundabout through Block B. A trap could be 
activated in one of the two cells in Block D chosen 
randomly. Once active, a trap stayed for 3 time steps, 
and only one trap was active at any time. The trap 
activation probability was 0.25. If the miner arrived at the Gold cell or fell into a trap, it 
was transported to S in the next time step. Note that arriving at the Gold cell or a trap 
was not the end of an episode, and the miner wandered around continually. 

The miner followed a fixed behavior policy according to which the miner was equally 
likely to take any of the four actions in Block A, more inclined to go up in both Block B 
and Block D, and more inclined to go left in Block C, in each case with probability 0.4. 
The rest of the actions were equally likely. 

We evaluated three fixed policies different than the behavior policy. We call them 
uniform, headfirst and cautious policies. Under the uniform policy, all actions were 
equally likely in every cell. Under the headfirst policy, the miner chose to go up in Block 
A and D with 0.9 probability while other actions from those blocks were equally likely. All 
the actions from other blocks were chosen with equal probability. Under the cautious 
policy, the miner was more inclined to go right in Block A, go up in both Block B and 
Block D, and go left in Block C, in each case with probability 0.6. The rest of the actions 
were equally likely. 

We were interested to predict how much gold the miner could collect before falling into a 
trap if the miner had used the above three policies, without executing any of these policies. 


5. An illustrative experiment 


Block C 



Figure 1. The Miner problem where 
a miner continually collects gold 
from the Gold cell until it falls into 
a trap, which can be activated in 
Block D. 
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We set 7 = 0 for those states where the miner got entrapped to indicate termination under 
the target policy (although behavior policy continued) and a discounting of 7 = 0.99 in 
other states. We set i(s) = 1 whenever the miner was in Block A and 0 everywhere else. 
As the behavior policy of the miner is different than the three target policies, it must use 
off-policy training to learn what could happen under each of those policies. We used three 
instances of ETD(A) for three different predictions, each using a = 0.001, A = 1.0 when 
the miner was in Block D, A = 0 in Block A, and A = 0.9 in other states. We clipped 
each component of the increment to 6 in (12) between —0.5 and -1-0.5 in order to reduce 
the impact of extremely large eligibility traces on updates. Clipping the increments can be 
shown to be theoretically sound, although we will not discuss this subject here. The state 
representation used four features: each corresponding to the miner being in one of the four 
blocks. The miner wandered continually until 3000 entrapments occurred. 


Amount 
of gold 





f cautious policy 

/ . . 

headfirst policy 


uniform policy 


Figure 2 shows estimates calculated by ETD(A) 
in terms of its weight corresponding to Block A for 
the three target policies. The curves shown are 
average estimates with two standard error bands 
using 50 independent runs. The dotted straight 
lines indicate the true state value estimated through 
Monte Carlo simulation from S. Due to the use 
of function approximation and clipping of the up¬ 
dates, the true value could not be estimated accu¬ 
rately. However, the estimates for the three poli¬ 
cies appear to approach values close to the true 
ones, and they preserved the relative ordering of 
the policies. In the absence of the clipping, the 
estimates were less stable and highly volatile, occa¬ 
sionally moving far away from the desired value for some of the runs. Although some of the 
learning curves still look volatile, clipping the updates reduced its extent considerably. 


# of entrapments under behavior policy 

Figure 2 . Simultaneous evaluation of 
three policies different than the behav¬ 
ior policy using ETD(A). 


6. Discussion and conclusions 

We summarized the motivations, key ideas and the available results on emphatic algorithms. 
Furthermore, we demonstrated how ETD(A) can be used to learn many predictions about 
the world simultaneously using off-policy learning, and the flexibility it provides through 
state-dependent discounting, bootstrapping and user-specified relative interests to states. 
ETD(A) is among the few algorithms with per-step linear computational complexity that 
are convergent under off-policy training. Compared to convergent gradient-based TD algo¬ 
rithms (Maei 2011), ETD(A) is simpler and easier to use; it has only one learned parameter 
vector and one step-size parameter. The problem of high variance is common in off-policy 
learning, and ETD(A) is susceptible to it as well. An extension to variance-reduction meth¬ 
ods, such as weighted importance sampling (Precup et al. 2000, Mahmood et al. 2014, 2015), 
can be a natural remedy to this problem. ETD(A) produces a different algorithm than the 
conventional TD(A) even in the on-policy case. It is likely that, in many cases, ETD(A) 
provides more accurate predictions than TD(A) through the use of relative interests and 
emphasis. An interesting direction for future work would be to characterize these cases. 
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